PREACHES - Portable Recovery and Checkpointing in Heterogeneous Systems

نویسندگان

  • Kuo-Feng Ssu
  • W. Kent Fuchs
چکیده

Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on the same type of machine and operating system, has been studied extensively. As heterogeneous distributed systems become pervasive, it is desirable to extend the capability of checkpointing to non-homogeneous environments. This paper describes a prototype, PREACHES, that achieves portable checkpointing of single process applications in heterogeneous systems using checkpoint propagation. The checkpoint propagation technique generates machine-dependent checkpoints for each different architecture in the heterogeneous environment. When failure occurs, the failed process can be restarted on a specified machine with the checkpoint that is appropriate for the architecture. An implementation of PREACHES on a heterogeneous network of workstations has been successfully developed based on TCP/IP communication. PREACHES also provides automatic and fast recovery for single process programs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Portable Checkpointing and Recovery in Heterogeneous Environments

Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover the failed computation on a different processor architecture, with possibly different byte-ordering and data-alignment specifications. This implies...

متن کامل

A portable and adaptable fault tolerance solution for heterogeneous applications

Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This paper proposes a checkpoint-based fault tolerance solution for heterogeneous applications, allowing them to survive fail-stop failures in the host CPU or in any of the accelerators used...

متن کامل

CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications

With the evolution of high-performance computing towards heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the execution or to a migration of the application processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipul...

متن کامل

An Application-Transparent, Platform-Independent Approach to Rollback-Recovery for Mobile Agent Systems

This paper proposes a new approach to rollback-recovery for mobile-agent systems, and describes its implementation in the MESSENGERS mobile agents system. The used checkpointing method allows to implement space and time efficient, user-transparent rollback-recovery in heterogeneous distributed environments. Together with an efficient non-blocking system snapshot algorithm this checkpointing met...

متن کامل

An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems

This paper presents an index based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint or recovery line The algorithm is based on an equivalence relation de ned between pairs of successive checkpoints of a process which allows in some cases to advance the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998